Advances in single-cell biology has enabled us to investigate isolated single cells at an unprecedented resolution. Current methods can measure subtle changes in cell state, revealing specialized minor subpopulations within a cell type. However, functional characterization of such subpopulations is still challenging. A promising approach to untangling functional characterization is spatial transcriptomics, where genomewide measurements of gene expression are enriched with spatial attributes. Such approaches allow us to determine the relative positions of cell subpopulations. Unfortunately, current spatial transcriptomic approaches suffer either from problems with sensitivity, are limited to a restricted panel of genes, or the spatial attributes are of low resolution.
In order to deconvolute the cell connectome, we sort multiplets of incompletely dissociated cells along with single cells from the same sample. The single cells provide a set of blueprints of possible subcell types, and subsequently we use swarm optimization approaches to infer the composition of multiplets. Using this method we can measure the composition of a very large number of multiplets onto a lattice consisting of subcell types of arbitrary complexity.
To accomplish these end-points an software package based on the R programming language has been deveolped named “spatial single cell RNA seq” or sp.scRNAseq. The package works in 3 seperate stages which are described below:
Below we will demonstrate the sp.scRNAseq packages functionality and, in parallel, describe and show results from the testing of the package on in silico and biological data.
To begin we use in silico data to evaluate if swarm optimization can be used as a deconvolution method for scRNAseq multuplets. We utilize random data generated from the negative binominal distribution to simulate scRNAseq singlets (single cells) with the .syntheticSinglets function. This function generates a data set of 100 singlet expression profiles with 2000 genes each and 10 different cell types. These cell types were combined into multuplets consisting of 2, 3, or 4 cells using the .syntheticMultuplets function. This dataset was then used used for testing the deconvolution process.
#load spatial scRNAseq testing package
library(sp.scRNAseqTesting)After the data was generated, tSNE was run to view the clusering of the singlets and classify them into cell types. The results from the tSNE, plotted below, indicate that each cell clusters well within it’s cell type and that no overlap exists between groups of cell types. This represents the “ideal” situation and gives us a easy starting point to understand if swarm optimization may be an appropriate method to use even with less “well behaved” datasets.
#load synthetic data
data(syntheticData)
#make spCounts object
cObj <- spCounts(counts = syntheticData, counts.ercc=matrix(), sampleType = "m.")#run unsupervised learning
uObj <- spUnsupervised(cObj, max=1000, max_iter = 1000)#plot unsupervised results
spPlot(uObj, type = "clusters")Figure 1. Unsupervised clustering of synthetic data singlets.
In the next, step swarm optimization was then evaluated as a method to deconvolute the multuplets, thus producing a cell-cell interaction map or “connectome”. To test this we generated multuplets based on the results of the post-tSNE classifications of the singlets. For example, to generate a multuplet representing the A1 and B1 group, one cell was picked from each group and the mean for each gene was calculated. Within the synthetic dataset multuplets were generated for all possible combinations of singlets using 2, 3, or 4 cells resulting in a total of 385 multuplets. We then picked 15 of these multuplets according to a specific number of connections desired between cell types (see table below). The swarm optimization was then tested using the spSwarm function . Due to the fact that we know which singlets were used to produce each multuplet and, therefore, which connections we expect, we can then measure the accuracy of the swarm optimization output.
#run pySwarm
result <- spSwarm(cObj, uObj, limit=10, maxiter=10, swarmsize=250, cores=5)syntheticDataTable| from/to | A1 | B1 | C1 | D1 | E1 | F1 | G1 | H1 | I1 | J1 |
|---|---|---|---|---|---|---|---|---|---|---|
| A1 | 1 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | 0 |
| B1 | 0 | 0 | 1 | 0 | 1 | 3 | 1 | 2 | 2 | 0 |
| C1 | 0 | 0 | 0 | 1 | 3 | 3 | 1 | 1 | 3 | 1 |
| D1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 1 |
| E1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 2 |
| F1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 3 | 1 |
| G1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| H1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| I1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Table 1. Expected connections in the synthetic multuplets.
Below we can see the network plot of the resulting connections detected after multuplet deconvloution. The singlets are clustered according to the previous tSNE results (small dots), the mean for each singlet classification group is represented by the large dots, and the detected interactions between groups (lines), and the number of interactions detected, or edge weights, are represented in the line thickness. Close inspection reveils that the detected connections mirror the expected connections in the table above.
#plot results
spPlot(result)Figure 2. Newtork graph of the deteced multuplets in the synthetic data.
To view the results in a easier manner, the heatmap below shows the real vs detected connections with the number of connections represented as the heatmap’s cell colors. The fact that all of the connections lay on a diagnal through the plot indicates that the swarm optimization method had a 100% sucess rate.
tabels <- calculateConnections(getData(result, "codedSwarm"), type="codedSwarm")[[2]]
real <- data.frame(tabels[,,'real'])
detected <- data.frame(tabels[,,'detected'])
d <- data.frame(
real = paste(
real[rep(1:nrow(real), real[["Freq"]]), ]$from,
real[rep(1:nrow(real), real[["Freq"]]), ]$to,
sep="-"
),
detected = paste(
detected[rep(1:nrow(detected), detected[["Freq"]]), ]$from,
detected[rep(1:nrow(detected), detected[["Freq"]]), ]$to,
sep="-"
)
)
table <- as.data.frame(table(d))
ggplot(table, aes(x=real, y=detected, fill=factor(Freq)))+
geom_tile(alpha=0.85)+
theme_few()+
scale_fill_manual(values = c("lightgrey", "#77B7E5", "#F2F7E8", "#F9BD7E", "#C0343E", "#450b18"))+
theme(
axis.text.x=element_text(angle=90),
legend.position="top"
)+
labs(
x="Real Connections",
y="Detected Connections",
fill="Number of Connections"
)Figure 3. Real vs deteced connections in the synthetic data multuplets.
Next we wanted to test swarm optimization using a dataset more representative of real data. To this end we used 131 fetal pancreas single cells that had been scRNA sequenced for this purpose. Below we can see the tSNE results from these singlets indicating that 8 cell types can be identified.
#load test data
data(expressionTestData)
#make spCounts object
cObj <- spCounts(counts = expressionTestData, counts.ercc=matrix(), sampleType = "m.")#run unsupervised learning
uObj <- spUnsupervised(cObj, max=2500, max_iter=10^5)#plot unsupervised results
spPlot(uObj, type = "clusters")Figure 4. Unsupervised clustering of the expression data singlets.
Next we used the single cell data and combined single cells into multuplets of 2, 3, or 4 cells using the [assembleTestData][] function. This resulted in 162 multuplets from which we then picked 15 to provide a specific expected connectome and ran swarm optimization. We can again start by looking at the table showing the expected connectome and the resulting network plot.
#run pySwarm
result2 <- spSwarm(cObj, uObj, limit=10, maxiter=10, swarmsize=250, cores=5)expressionTestTable| from/to | A1 | C1 | D1 | E1 | F1 | G1 | H1 |
|---|---|---|---|---|---|---|---|
| A1 | 1 | 2 | 2 | 0 | 3 | 2 | 1 |
| B1 | 0 | 2 | 0 | 1 | 3 | 1 | 2 |
| C1 | 0 | 1 | 2 | 1 | 3 | 2 | 4 |
| D1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 |
| E1 | 0 | 0 | 0 | 0 | 2 | 1 | 2 |
| F1 | 0 | 0 | 0 | 0 | 0 | 2 | 3 |
| G1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
Table 2. Expected connections in the expression data multuplets
#plot results
spPlot(result2)Figure 5. Network graph of the connections detected in the expression data multuplets.
Plotting the results in the heatmap format indicate that swarm optimization is, again, able to deconvolute the multuplets within this dataset with a 100% sucess rate.
tabels <- calculateConnections(getData(result2, "codedSwarm"), type="codedSwarm")[[2]]
real <- data.frame(tabels[,,'real'])
detected <- data.frame(tabels[,,'detected'])
d <- data.frame(
real = paste(
real[rep(1:nrow(real), real[["Freq"]]), ]$from,
real[rep(1:nrow(real), real[["Freq"]]), ]$to,
sep="-"
),
detected = paste(
detected[rep(1:nrow(detected), detected[["Freq"]]), ]$from,
detected[rep(1:nrow(detected), detected[["Freq"]]), ]$to,
sep="-"
)
)
table <- as.data.frame(table(d))
ggplot(table, aes(x=real, y=detected, fill=factor(Freq)))+
geom_tile(alpha=0.85)+
theme_few()+
scale_fill_manual(values = c("lightgrey", "#77B7E5", "#F2F7E8", "#F9BD7E", "#C0343E", "#450b18"))+
theme(
axis.text.x=element_text(angle=90),
legend.position="top"
)+
labs(
x="Real Connections",
y="Detected Connections",
fill="Number of Connections"
)Figure 6. Real vs detected connections in expression data multuplets.
Next, we used the sp.scRNAseq package on experimental data where both singlets and multuplets from fetal pancreatic tissue were physically isolated, via FACS sorting, previous to scRNAseq. The dataset includes 131 singlets and 69 multuplets.
scRNAseq was performed with ercc spike-ins where a constant amount of nonhuman control RNA is spiked into each well when the plate is prepared. In the plot below we can see the ercc spike-in levels for singlets and multuplets. Multiplets have on average ~30% the number of ERCC reads, indicating that each contain on average three cells. The fraction depends on various other things, such as the size of the cell and how much of the cell’s RNA that was successfully recovered.
#load test data
data(expData)
#expData is already a spCounts object
cObj <- expDataspPlot(cObj, type="ercc")Figure 7. Fraction of ERCC spike-ins in experimental data.
Next, we wanted to examine known cell identity marker expression within the singlets and multuplets. We expected that, whereas the singlets should exclusivley express these markers, the multuplets would show examples where multiple markers are expressed. Although some markers appear to be more promiscuous than others, we see that many of the markers follow this pattern.
markers <- c("INS", "THY1")
spPlot(cObj, type="markers", markers=markers)markers <- c("INS", "GCG")
spPlot(cObj, type="markers", markers=markers)markers <- c("FLT1", "THY1")
spPlot(cObj, type="markers", markers=markers)markers <- c("EPCAM", "THY1")
spPlot(cObj, type="markers", markers=markers)markers <- c("PROM1", "THY1")
spPlot(cObj, type="markers", markers=markers)markers <- c("PROM1", "INS")
spPlot(cObj, type="markers", markers=markers)Figure 8. Cellular marker expression in experimental data singlets and multuplets.
Next, we ran tSNE and plotted the results. This reveals the presence of 9 distinct cell types within the tissue.
#run unsupervised learning
uObj <- spUnsupervised(cObj, max=2000, max_iter=20000)#plot unsupervised results; cluster plot
spPlot(uObj, type="clusters")Figure 9. Unsupervised clustering of the experimental singlets.
To relate the resulting tSNE-based classifications to previously known cell identities we further plotted the tSNE results showing expression levels (log 2 normalized counts) of several well known cell surface markers. Note that the dot colors in the plot represent the expression levels.
#plot unsupervised results; markers plot
markers <- c("EPCAM", "THY1", "FLT1", "INS", "GCG", "NEUROG3", "PROM1", "SST", "PRSS1")
spPlot(uObj, type="markers", markers=markers)Figure 10. Cellular marker expression in post-tSNE classification groups.
Finally, we used the sp.scRNAseq package to analyze the cell-cell interactions within the multuplets. The results indicate the presence of a high number of interactions between cell types B1 and A1, as well as, B1 and D1. Examination of the markers plot indicates that this may represent a differation process as there is a inverse correlation between NEUROG3 and INS in these groups. Further connections can be viewed in the plot.
#run pySwarm
result3 <- spSwarm(cObj, uObj, maxiter=10, swarmsize=250, cores=8)#plot results
spPlot(result3)Figure 11. Network plot of experimental data multuplet connections.
syntheticDataTest <-
function (outPath = "data", cores = 5, n = 15, ngenes = 2000,
ncells = 100, cellTypes = 10, ...)
{
tmp <- .syntheticTestData(n, ngenes, ncells, cellTypes)
syntheticData <- tmp[[1]]
syntheticDataTest <- tmp[[2]]
syntheticDataUnsupervised <- tmp[[3]]
syntheticDataTable <- tmp[[4]]
syntheticDataCounts <- spCounts(syntheticDataTest, matrix(),
sampleType = "m.")
syntheticDataSwarm <- spSwarm(syntheticDataUnsupervised,
limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
save(syntheticData, file = paste(outPath, "syntheticData.rda",
sep = "/"), compress = "bzip2")
save(syntheticDataCounts, file = paste(outPath, "syntheticDataCounts.rda",
sep = "/"), compress = "bzip2")
save(syntheticDataUnsupervised, file = paste(outPath, "syntheticDataUnsupervised.rda",
sep = "/"), compress = "bzip2")
save(syntheticDataSwarm, file = paste(outPath, "syntheticDataSwarm.rda",
sep = "/"), compress = "bzip2")
save(syntheticDataTable, file = paste(outPath, "syntheticDataTable.rda",
sep = "/"), compress = "bzip2")
}(wraps .syntheticSinglets and .syntheticMultuplits into one function)
.syntheticTestData <-
function (n, ngenes, ncells, cellTypes, perplexity = 10)
{
tmp <- .syntheticMultuplets(ngenes, ncells, cellTypes, perplexity)
singlets <- tmp[[1]]
multuplets <- tmp[[2]]
uObj <- tmp[[3]]
testMultuplets <- multuplets[, sample(1:ncol(multuplets),
n, replace = FALSE)]
table <- calculateConnections(testMultuplets, type = "multuplets")
names <- c(paste("s", colnames(singlets), sep = "."), paste("m",
colnames(multuplets), sep = "."))
testNames <- c(paste("s", colnames(singlets), sep = "."),
paste("m", colnames(testMultuplets), sep = "."))
testSyntheticData <- cbind(singlets, testMultuplets)
syntheticData <- cbind(singlets, multuplets)
colnames(testSyntheticData) <- testNames
colnames(syntheticData) <- names
testSyntheticData <- as.matrix(testSyntheticData)
syntheticData <- as.matrix(syntheticData)
counts(uObj) <- testSyntheticData
counts.log(uObj) <- sp.scRNAseq:::.norm.log.counts(testSyntheticData)
sampleType(uObj) <- c(getData(uObj, "sampleType"), rep("Multuplet",
n))
return(list(syntheticData, testSyntheticData, uObj, table))
}.syntheticSinglets <-
function (ngenes, ncells, cellTypes)
{
for (i in 1:cellTypes) {
set.seed(i)
meanExprs <- 2^runif(ngenes, 0, 5)
counts <- matrix(rnbinom(ngenes * ncells, mu = meanExprs,
size = i), nrow = ngenes)
if (i == 1) {
singlets <- counts
}
else {
singlets <- cbind(singlets, counts)
}
}
colnames(singlets) <- paste(sort(rep(letters, ncells))[1:(cellTypes *
ncells)], 1:ncells, sep = "")
singlets <- as.data.frame(singlets)
return(singlets)
}.syntheticMultuplets <-
function (ngenes, ncells, cellTypes, perplexity)
{
singlets <- .syntheticSinglets(ngenes, ncells, cellTypes)
cObj <- spCounts(as.matrix(singlets), counts.ercc = matrix(),
sampleType = "[A-Z]")
uObj <- spUnsupervised(cObj, max = ngenes, max_iter = 1000,
perplexity = perplexity)
colnames(singlets) <- getData(uObj, "classification")
mean <- getData(uObj, "groupMeans")
combos <- t(matrix(rep(unique(colnames(singlets)), 2), ncol = 2))
for (y in 1:ncol(combos)) {
current <- combos[, y]
new <- data.frame(rowMeans(data.frame(mean[, colnames(mean) %in%
current[1]], mean[, colnames(mean) %in% current[2]])))
if (y == 1) {
multuplets <- new
names <- paste(current, collapse = "")
}
else {
multuplets <- cbind(multuplets, new)
names <- c(names, paste(current, collapse = ""))
}
}
combos <- combn(unique(colnames(singlets)), 2)
for (y in 1:ncol(combos)) {
current <- combos[, y]
new <- data.frame(rowMeans(mean[, colnames(mean) %in%
current]))
multuplets <- cbind(multuplets, new)
names <- c(names, paste(current, collapse = ""))
}
combos <- combn(unique(colnames(singlets)), 3)
for (u in 1:ncol(combos)) {
current <- combos[, u]
new <- data.frame(rowMeans(mean[, colnames(mean) %in%
current]))
multuplets <- cbind(multuplets, new)
names <- c(names, paste(current, collapse = ""))
}
combos <- combn(unique(colnames(singlets)), 4)
for (u in 1:ncol(combos)) {
current <- combos[, u]
new <- data.frame(rowMeans(mean[, colnames(mean) %in%
current]))
multuplets <- cbind(multuplets, new)
names <- c(names, paste(current, collapse = ""))
}
colnames(multuplets) <- names
return(list(singlets, multuplets, uObj))
}expressionDataTest <-
function (outPath = "data", cores = 5, n = 15, ...)
{
tmp <- .expressionTestData(n)
expressionTestData <- tmp[[1]]
testExpTestData <- tmp[[2]]
expressionTestUnsupervised <- tmp[[3]]
expressionTestTable <- tmp[[4]]
expressionTestDataCounts <- spCounts(testExpTestData, matrix(),
"m.")
expressionTestSwarm <- spSwarm(expressionTestUnsupervised,
limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
save(expressionTestData, file = paste(outPath, "expressionTestData.rda",
sep = "/"), compress = "bzip2")
save(expressionTestDataCounts, file = paste(outPath, "expressionTestDataCounts.rda",
sep = "/"), compress = "bzip2")
save(expressionTestUnsupervised, file = paste(outPath, "expressionTestUnsupervised.rda",
sep = "/"), compress = "bzip2")
save(expressionTestSwarm, file = paste(outPath, "expressionTestSwarm.rda",
sep = "/"), compress = "bzip2")
save(expressionTestTable, file = paste(outPath, "expressionTestTable.rda",
sep = "/"), compress = "bzip2")
}.expressionTestData <-
function (n)
{
counts <- getData(expData, "counts")
sampleType <- getData(expData, "sampleType")
sng <- counts[, sampleType == "Singlet"]
unsupervised <- spUnsupervised(expData, max = 2500, max_iter = 10^5)
classification <- getData(unsupervised, "classification")
means <- getData(unsupervised, "groupMeans")
dataset <- data.frame(row.names = 1:nrow(sng))
names <- c()
tmp <- .assemble(means, classification, dataset, names, 2,
type = "homo")
dataset <- tmp[[1]]
names <- tmp[[2]]
tmp <- .assemble(means, classification, dataset, names, 2,
type = "hetero")
dataset <- tmp[[1]]
names <- tmp[[2]]
tmp <- .assemble(means, classification, dataset, names, 3,
type = "hetero")
dataset <- tmp[[1]]
names <- tmp[[2]]
tmp <- .assemble(means, classification, dataset, names, 4,
type = "hetero")
dataset <- tmp[[1]]
names <- tmp[[2]]
colnames(dataset) <- paste("m.", names, sep = "")
colnames(sng) <- paste("s.", classification, sep = "")
testMultuplets <- dataset[, sample(1:ncol(dataset), n, replace = FALSE)]
table <- calculateConnections(testMultuplets, type = "multuplets")
expTestData <- as.matrix(cbind(sng, dataset))
testExpTestData <- as.matrix(cbind(sng, testMultuplets))
counts(unsupervised) <- testExpTestData
counts.log(unsupervised) <- sp.scRNAseq:::.norm.log.counts(testExpTestData)
sampleType(unsupervised) <- c(rep("Singlet", ncol(sng)),
rep("Multuplet", n))
return(list(expTestData, testExpTestData, unsupervised, table))
}.assemble <-
function (means, classification, dataset, names, x, type)
{
if (type == "hetero") {
comb <- combn(unique(classification), x)
}
else {
comb <- t(matrix(rep(unique(classification), 2), ncol = 2))
}
for (i in 1:ncol(comb)) {
currentMult <- rowMeans(means[, comb[, i]])
name <- paste(comb[, i], sep = "", collapse = "")
dataset <- cbind(dataset, currentMult)
names <- c(names, name)
}
return(list(dataset, names))
}experimentalDataTest <-
function (outPath = "data", cores = 5, ...)
{
data(expData)
cObj <- expData
experimentalDataUnsupervised <- spUnsupervised(cObj, max = 2000,
max_iter = 2 * 10^4)
experimentalDataSwarm <- spSwarm(experimentalDataUnsupervised,
limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
save(experimentalDataUnsupervised, file = paste(outPath,
"experimentalDataUnsupervised.rda", sep = "/"), compress = "bzip2")
save(experimentalDataSwarm, file = paste(outPath, "experimentalDataSwarm.rda",
sep = "/"), compress = "bzip2")
}